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(54) Method and signal processing device for converting stereo signals for headphone listening 



(57) The invention relates to a method for converting 
signals in two-channel stereo format to become suitable 
to be played back using headphones. The invention also 
relates to a signal processing device for carrying out 
said method. According to the invention left direct path 
(L d ) and left cross-talk path (L x ) signals are formed from 
the leftinputsignal (L in ), and correspondingly rightdirect 
path (R d ) and right cross-talk path (R x ) signals are 
formed from the right input signal (R in ), and further the 
left output signal (L out ) is formed by combining said left 



direct-path (L d ) and said right cross-talk path (R x ) sig- 
nals, and correspondingly, the right output signal (R out ) 
is formed by combining said right direct-path (R d ) and 
said left cross-talk path (L x ) signals. The direct path sig- 
nals (L d ,R d ) each are formed using filtering (1,3) asso- 
ciated with first frequency dependent gain (G d ) and the 
cross-talk path signals (L X ,R X ) each are formed using 
filtering (2,4) associated with second frequency depend- 
ent gain (G x ) and by adding interaural time difference 
(ITD) (5,6). 
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Description 

[0001] The present invention relates to a method ac- 
cording to the preamble of the appended claim 1 for con- 
verting signals in two-channel stereo format to become 
suitable to be played back using headphones. The in- 
vention also relates to a signal processing device ac- 
cording to the preamble of the appended claim 7 for car- 
rying out said method. 

[0002] Already for several decades the prevailing for- 
mat for making music and other audio recordings and 
public broadcasts has been the well-known two-channel 
stereo format. The two-channel stereo format consists 
of two independent tracks or channels; the left (L) and 
the right channel, which are intended for playback using 
two separate loudspeaker units. Said channels are 
mixed and/or recorded and/or otherwise prepared to 
provide a desired spatial impression to a listener, who 
is positioned centrally in front of the two loudspeaker 
units spanning ideally 60 degrees with respect to the lis- 
tener. When a two-channel stereo recording is listened 
through the left and right loudspeakers arranged in the 
above described manner, the listener experiences a 
spatial impression resembling the original sound scen- 
ery. In this spatial impression the listener is able to ob- 
serve the direction of the different sound sources, and 
the listener also acquires a sensation of the distance of 
the differentsound sources. In other words, when atwo- 
channel stereo recording is listened, the sound sources 
seem to be located somewhere in front of the listener 
and inside the area substantially located between the 
left and the right loudspeaker unit. 
[0003] Other audio recording formats are also known, 
which, instead of only two loudspeaker units, rely on the 
use of more than two loudspeaker units for the playback. 
For example, in a four channel stereo system two loud- 
speaker units are positioned in front of the listener: one 
to the left and one to the right, and two other loudspeaker 
units are positioned behind the listener: to the rear left 
and to the rear right, respectively. This allows to create 
a more detailed spatial impression of the sound scenery, 
where the sounds can be heard coming not only some- 
where from the area located in front of the listener, but 
also from behind, or directly from the side of the listener. 
Such multichannel playback systems are nowadays 
commonly used for example in movie theatres. Record- 
ings for these multichannel systems can be prepared to 
have independent tracks for each separate channel, or 
the information of the channels in addition to a normal 
two-channel stereo format can also be coded into the 
left and right channel signals in a two-channel stereo 
format recording. In the latter case a special decoder is 
required during the playback to extract the signals for 
example for the rear left and rear right channels. 
[0004] Further, some special methods are known in 
order to prepare recordings, which are specially intend- 
ed to be listened through headphones. These include, 
for example, binaural recordings that are made of re- 



cording signals corresponding to the pressure signals 
that would be captured by the eardrums of a human lis- 
tener in a real listening situation. Such recordings can 
be made for example by using a dummy-head, which is 

5 an artificial head equipped with two microphones replac- 
ing the two human ears. When a high-quality binaural 
recording is listened through headphones, the listener 
experiences the original, detailed three-dimensional 
sound image of the recording situation. 

w [0005] The present invention is however mainly relat- 
ed to such two-channel stereo recordings, broadcasts 
or similar audio material, which have been mixed and/ 
or otherwise prepared to be listened through two loud- 
speaker units, which said units are intended to be posi- 

15 tioned in the previously described manner with respect 
to the listener. Hereinbelow, the use of the short term 
"stereo" refers to aforementioned kind of two-channel 
stereo format, if anything else is not separately men- 
tioned. The listening of audio material in such stereo for- 

20 mat through two loudspeakers is hereinbelow shortly re- 
ferred to as "natural listening". 

[0006] During the last decade portable personal ster- 
eo devices, such as portable tape- and CD-players, for 
example, have become increasingly popular. This de- 

25 velopment has, among other things, strongly increased 
the use of headphones in the listening of music record- 
ings, radio broadcasts etc. However, the commercially 
available music recordings and other audio material are 
almost exclusively in the two-channel stereo format, and 

30 thus intended for playback over loudspeakers and not 
over headphones. Despite of this fact, it is common to 
the portable stereo devices, and also to other playback 
systems, that they do not make any attempt to compen- 
sate for the fact that stereo recordings are intended for 

35 playback over loudspeakers and not over headphones. 
[0007] When a stereo recording is played back over 
loudspeakers in a natural listening situation, the sound 
emitted from the left loudspeaker is heard not only by 
the listener's left ear but also by the right ear, and cor- 

40 respondingly the sound emitted from the right loud- 
speaker is heard both by the right and left ear. This con- 
dition is of primary importance for the generation of a 
hearing impression with a correct spatial feeling. In other 
words, this is important in order to generate a hearing 

45 impression in which the sounds seem to originate from 
a space or stage outside. When listening a stereo re- 
cording over headphones, the left channel is heard in 
the left ear only, and the right channel is heard in the 
right ear only. This causes the hearing impression to be 

50 both unnatural and tiresome to listen to, and the sound 
scenery or stage is contained entirely inside the listen- 
er's head: the sound is not externalised as intended. 
[0008] Prior art methods, that are intended for improv- 
ing the sound quality of two-channel stereo recordings 

55 when presented over headphones, come mainly in the 
following two types. 

[0009] The first type of methods is based on the em- 
ulation of a natural listening situation, in which situation 
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the sound would normally be reproduced through loud- 
speakers. In other words, the stereo signals played back 
through the headphones are processed in order to cre- 
ate in the listener's ears an impression of the sound 
coming from a pair of "virtual loudspeakers", and thus 
further resembling the listening to the real original sound 
sources. Methods belonging to this category are re- 
ferred later in this text as "virtual loudspeaker methods". 
[0010] The second type of methods is not based on 
attempting to create an accurate natural listening or nat- 
ural sound scenery at all, but they rely on methods such 
as adding reverberation, boosting certain frequencies, 
or boosting simply the channel difference signal (L mi- 
nus R). These methods have been empirically found to 
somewhat improve the hearing impression. Later in this 
text methods belonging to this category are referred as 
"equalizers" or " advanced equalizers". 
[0011] In the following, the virtual loudspeaker meth- 
od and the methods based on different types of equal- 
izers are discussed in somewhat more detail. 
[0012] If sound is emitted from a loudspeaker posi- 
tioned for example to the left side of the listener, it is 
possible to determine the sound pressures created at 
the listener's left and right ear. Comparing the loud- 
speaker input signal to the sound pressure signals ob- 
served at the listener's left and right ear, it is possible to 
model the behaviour of the acoustic path that transfers 
the sound to the listener's ears. When this is performed 
separately for both the left and right channels, it is fur- 
ther possible to realize signal filters, which can be used 
to process the loudspeaker input signals according to 
the behaviour of said acoustic paths. By processing the 
original signals using such filters, and playing back the 
filtered signals through headphones, ideally same 
sound pressures are reproduced at the listener's ears 
as in the case of listening the original signals through 
loudspeakers. The above described virtual loudspeaker 
method is thus, at least in theory, a scientifically justified 
and credible method to emulate the natural listening 
conditions. 

[001 3] Each of the acoustic paths is made up of three 
main components: the radiation characteristics of the 
sound sources (such as a pair of loudspeakers), the in- 
fluence of the acoustic environment (which causes early 
reflections from nearby surfaces and late reverbera- 
tion), and the presence of the receiver (a human listen- 
er) in the sound field. The loudspeaker is usually not 
modelled explicitly, rather it is assumed to have a flat 
magnitude response and an omni-directional radiation 
pattern. The reflections from the acoustic environment 
are used by the listener to form an impression of the 
surroundings, and by modelling the early reflections [US 
5,371,799; US 5,502,747; US 5,809,149] and the late 
reverberation [US 5,371,799; US 5,502,747; US 
5,802,180; US 5,809,149; US 5,812,674], it is possible 
to give the listener the impression of being in an en- 
closed space. However, when using the given prior art 
methods this cannot be achieved without making a no- 



ticeable and negative change to the overall sound qual- 
ity. 

[0014] The effect of the receiver on the incoming 
sound waves, and in particular the effect of the human 

5 head and pinna (outer ear, earlobe), has been studied 
intensively by the research community for several dec- 
ades. An acoustic path which includes a realistic mod- 
elling of the listener's head, and possibly the listener's 
torso and/or pinna, is usually referred to as a head-re- 

10 lated transfer function (HRTF). HRTFs are usually 
measured on so-called dummy-heads under anechoic 
conditions, and it is common practice to equalize, i.e. to 
correct the raw measured data for the response of the 
transducer chain, which typically consists of an amplifi- 
es er, a loudspeaker, a microphone, and some data acqui- 
sition equipment. The HRTF to the ear closest to the 
loudspeaker is referred to as the ipsilateral HRTF, 
whereas the HRTF to the other ear further away from 
the loudspeaker is referred to as the contralateral HRTF. 

20 [0015] The human auditory system combines, and 
compares the sounds filtered by the ipsilateral and con- 
tralateral HRTFs for the purpose of localising a source 
of sound. It is a generally accepted fact that the auditory 
system uses different mechanisms to localise sound 

25 sources at low- and high frequencies. At frequencies be- 
low approximately 1 kHz, the acoustical wavelength is 
relatively long compared to the size of the listener's 
head, and this causes an interaural phase difference to 
take place between the sound waves originating from a 

30 sound source (loudspeaker) and arriving to the listener's 
two ears. Said interaural phase difference can be trans- 
lated into an interaural time difference (ITD), which in 
other words is the time delay between the sound arriving 
at the listener's closest and furthest ear. For sound 

35 sources in the horizontal plane, a large ITD means that 
the source is to the side of the listener whereas a small 
ITD means that the source is almost directly in front of, 
or directly behind, the listener. 

[0016] At frequencies above approximately 2 kHz the 

40 acoustical wavelength is shorter than the human head, 
and the head therefore casts an acoustic shadow that 
causes an interaural level difference (ILD) to take place 
between the sound waves originating from a sound 
source and arriving at the listener's two ears. In other 

45 words, the sound pressures arriving at the listener's 
closest and furthest ear are different. At frequencies 
above 5 kHz, the acoustical wavelength is so short that 
the pinna contributes to large variations in interaural lev- 
el difference ILD as a function of both the frequency and 

50 the position of the sound source. 

[0017] Thus, localisation of sound sources at low fre- 
quencies is mainly determined by interaural time differ- 
ence ITD cues whereas localisation of sound sources 
at high frequencies is mainly determined by interaural 

55 level difference ILD cues. 

[0018] Prior art systems that implement the virtual 
loudspeaker method over headphones attempt to in- 
clude both low frequency ITD cues and high-frequency 
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ILD cues, at least to the extent that ILD is not constant 
above 3 kHz. There are many ways in which this high- 
frequency variation can be extracted and implemented 
[US 3,970,787; US 5,596,644; US 5,659,619; US 
5,802,180; US 5,809,149; US 5,371,799; and also W0 
97/25834]. One system even exaggerates the ILD in or- 
der to achieve a more convincing spatial effect [EP 0966 
179 A2]. 

[0019] In practice, the drawbacks of the aforemen- 
tioned virtual loudspeaker-type methods concentrate on 
the amount of detail contained in an accurate model of 
the acoustic paths, and further on the difficulties in being 
able to accurately design and realize the necessary sig- 
nal filters. Today such filters can best be realized using 
digital signal processing techniques (DSP). However, 
the dynamic range of the necessary digital filters is rath- 
er large, and this has the undesirable side-effect that the 
filters introduce unwanted colouration of the reproduced 
sound. This colouration of the sound takes place espe- 
cially at the higher frequencies, and it is particularly no- 
ticeable on high-fidelity recordings. 
[0020] Methods that fall into categories of "equalizers" 
or "advanced equalizers" cannot be considered to be 
so-called spatial enhancers in the strict sense of this def- 
inition, since they do not succeed in really externalising 
any part of the sound scenery. The basic idea of boost- 
ing the channel difference signal (L minus R channel) in 
a two-channel stereo format is based on the observation 
that the difference signal seems to contain more spatial 
information than the channel sum signal (L plus R). 
When headphones are used, the effect of increasing the 
level of the channel difference signal makes the sound 
sources at right and left to become more audible, where- 
as the sound sources near the centre are essentially un- 
affected. Thus, the sound components that are at the 
extreme left and extreme right on the sound scenery or 
stage are effectively made louder, but spatially they still 
remain at the same locations. However, if the effect 
boosts the overall sound level by a couple of decibels 
when it is switched on, it will sound like an improvement. 
In fact, an increase in the overall sound level will be usu- 
ally interpreted by the listener as an improvement in the 
quality of the sound, irrespective of the method by 
means of which it was exactly accomplished. Most of 
the "spatializer" or "expander" functions that can be 
found today for example in tape players, CD-players or 
PC sound cards, can be considered as kind of advanced 
equalizers affecting the level of the channel difference 
signal [US 4,748,669]. 

[0021] A known method is also to use a simple low- 
frequency boost, which is an effective method especially 
when used together with headphones. This is because 
headphones are much less efficient in reproducing low 
frequencies than loudspeakers. A low-frequency boost 
helps to restore the spectral frequency balance of the 
recording in playback, but no spatial enhancement can 
be achieved. 

[0022] It is also known, that by adding reverberation 



to the stereo signals it is possible to give a listener an 
impression somewhat similar to the one experienced 
when listening music in a room or other similar closed 
space. It is well known that the ratio between direct 

5 sound and reflected, reverberated sound affects the hu- 
man sensation of how far the sound source is experi- 
enced to be. The more reverberation, the farther away 
the sound source seems to be. However, high-quality, 
high-fidelity recordings already contain the correct 

10 amount of reverberation, and thus adding even more re- 
verberation will degrade the result, usually giving an im- 
pression that the recording was performed in a base- 
ment or in a bathroom. 

[0023] The main purpose of the present invention is 
15 to produce a novel and simple method for converting 
two-channel stereo format signals to become suitable 
to be played back using headphones. The present in- 
vention is based on a virtual loudspeaker-type approach 
and is thus capable of externalising the sounds so that 
20 the listener experiences the sound scenery or stage to 
be located outside his/her head in a manner similar to 
a natural listening situation. The aforementioned effect 
attained by using the method according to the invention 
is later in this text referred to as "stereo widening". 
25 [0024] To attain this purpose, the method according 
to the invention is primarily characterized in what will be 
presented in the characterizing part of the independent 
claim 1 . 

[0025] Furthermore, it is the purpose of this invention 

30 to attain a signal processing device which implements 
the method according to the invention. The signal 
processing device according to the invention is primarily 
characterized in what will be presented in the charac- 
terizing part of the independent claim 7. 

35 [0026] The other dependent claims present some pre- 
ferred embodiments of the invention. 
[0027] The basic idea behind the present invention is 
that it does not rely on detailed modelling of interaural 
level difference ILD cues, especially the high-frequency 

40 |LD cues; rather it omits excessive detail in order to pre- 
serve the sound quality. This is achieved by associating 
the high frequency ILD with a substantially constant val- 
ue (equal for both channels L and R) above a certain 
frequency limit f H | GH , and also by associating the low 

45 frequency ILD with an another substantially constant 
value below a certain frequency limit f LOW - 
[0028] In addition, the invention further sets the mag- 
nitude responses of the ipsilateral and contralateral 
HRTFs in such a way that their sum remains substan- 

50 tially constant as a function of frequency. Hereinbelow 
this is referred to as "balancing" and it is different from 
prior art methods, including the ones described in W0 
98/20707 and US 5,371 ,799 which manipulate the con- 
tralateral HRTF only while maintaining a substantially 

55 flat magnitude response of the ipsilateral HRTF over the 
entire frequency range. 

[0029] The method and device according to the inven- 
tion are significantly more advantageous than prior art 
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methods and devices in avoiding/minimizing unwanted 
and unpleasant colouration of the reproduced sound in 
the case of high-quality and high-fidelity audio material. 
In addition, the method according to the invention re- 
quires only a modest amount of computational power, 
being thus especially suitable to be implemented in dif- 
ferent types of portable devices. The stereo widening 
effect according to the invention can be implemented 
efficiently by using fixed-point arithmetic digital signal 
processing by a specific filter structure. 
[0030] An considerable advantage of the present in- 
vention is that it does not degrade the excellent sound 
quality available today from digital sound sources as for 
example CompactDisk players, MiniDisk players, 
MP3-players and digital broadcasting techniques. The 
processing scheme according to the invention is also 
sufficiently simple to run in real-time on a portable de- 
vice, because it can be implemented at modest compu- 
tational expense using fixed-point arithmetic. 
[0031] When used in connection with the method ac- 
cording to the invention, compared to the sound repro- 
duction via loudspeakers, headphone reproduction has 
the advantage of not depending on the characteristics 
of the acoustical environment, or on the position of the 
listener in that environment. The acoustics of a car cab- 
in, for example, is very different from the acoustics of a 
living room, and the listener's position relative to the 
loudspeakers is also different, and not necessarily ideal 
in these two situations. Headphones, however, sound 
consistently the same regardless of the acoustic envi- 
ronment, and further, if the type and characteristics of 
headphones are known in advance, it is possible to de- 
sign a system which gives good sound reproduction in 
all situations. Furthermore, the capabilities of the mod- 
ern high-quality and high-fidelity digital recording and 
playback facilities back up these possibilities well. 
[0032] The preferred embodiments of the invention 
and their benefits will become more apparent to a per- 
son skilled in the art through the description hereinbe- 
low, and also through the appended claims. 
[0033] In the following, the invention will be described 
in more detail with reference to the appended drawings, 
in which 

Fig. 1 illustrates natural listening to stereo record- 
ing played back through two loudspeaker 
units, 

Fig. 2 illustrates the basic idea of the present inven- 
tion, i.e. the use of a balanced stereo widen- 
ing network, 

Fig. 3 shows in more detail the structure of the bal- 
anced stereo widening network, 

Fig. 4a shows a block diagram of a digital filter struc- 
ture used in a preferred embodiment of the 
balanced stereo widening network, 



Fig. 4b shows the magnitude response of the digital 
filter structure shown in Fig. 4a, 

Fig. 5 illustrates the use of the digital filter structure 
5 shown in Fig. 4a in implementing the signal 

processing elements emulating avirtual loud- 
speaker to the left of the listener, 

Fig. 6 shows a block diagram of the balanced ster- 
eo eo widening network using the digital filter 
structure described in Figs 4a and 5 in the 
specific case (G d = 2, G x = 0), and 

Fig. 7 illustrates the use of optional pre- and/or 
15 post-processing in connection with the bal- 

anced stereo widening network. 

[0034] Fig. 1 illustrates a natural listening situation, 
where a listener is positioned centrally in front of left and 
20 right loudspeakers L, R. Sound coming from the left loud- 
speaker L is heard at both ears and, similarly, sound 
coming from the right loudspeaker R is also heard at 
both ears. Consequently, there are four acoustic paths 
from the two loudspeakers to the two ears. In Fig. 1 the 
25 direct paths are denoted by subscript d (L d and R d ) and 
the cross-talk paths by subscript x (L x and R x ). However, 
when the loudspeakers L,R are positioned exactly sym- 
metrically with respect to the listener, the direct path L d 
from the left loudspeaker L to the left ear has ideally the 
30 same length and acoustic properties as the direct path 
R d from the right loudspeaker R to the right ear, and, 
similarly the cross-talk path L x from the left loudspeaker 
L to the right ear has ideally the same length and acous- 
tic properties as the cross-talk path R x from the right 
35 loudspeaker R to the left ear. Thus, both the direct (ip- 
silateral) path and the cross-talk (contralateral) path can 
be associated with a frequency-dependent gain, G d and 
G x respectively, and a frequency-dependent delay, t and 
t+lTD, respectively. The difference between the delays 
40 in the direct path and the cross-talk path corresponds to 
the interaural time difference ITD, and the difference be- 
tween the gains in the direct path and the cross-talk path 
corresponds to the interaural level difference ILD. 
[0035] Fig. 2 shows schematically the basic idea of 
45 the present invention. Left and right stereo signals L in , 
R in are processed using a balanced stereo widening 
network BSWN, which applies the virtual loudspeaker- 
type method with careful choice of simplified head-re- 
lated sound transfer functions HRTFs, which said func- 
50 tions can be described by the direct gain G d , the cross- 
talk gain G x and the interaural time difference ITD. The 
aforementioned processing produces signals L out and 
R out , respectively, which signals can be used in head- 
phone listening in order to create a spatial impression 
55 resembling a natural listening situation, in which the 
sound is externalised outside the listener's head. 
[0036] Fig. 3 shows in more detail the structure of the 
balanced stereo network BSWN. The left and right chan- 
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nel signals L in ,R in are divided both into direct and cross- 
talk paths L d ,L x and R d ,R x , respectively. This creates a 
total of four paths, which paths are all filtered separately 
using first and second filtering means 1 and 2 for the left 
direct path L d and the left cross-talk path L x , respective- 
ly, and third and fourth filtering means 3 and 4 for the 
right direct path R d and the right cross-talk path R x , re- 
spectively. Said filtering means are associated with 
gains G d and G x for the direct paths and cross-talk 
paths, respectively. Both cross-talk paths L x and R x also 
include delay adding means 5 and 6 for adding the in- 
teraural time difference ITD, respectively. Said delay 
adding means 5 and 6 both have gain equal to one. Left 
direct path L d is further summed up with the right cross- 
talk path R x using combining means 7 to form left chan- 
nel output signal L out , and right direct path R d is corre- 
spondingly summed up with the left cross-talk path L x 
using combining means 8 to form right channel output 
signal R out . In addition, network BSWN includes scaling 
means 9,1 0 and 11,12 for scaling each paths L d ,L x and 
R d ,R x separately. 

[0037] In order to produce a natural listening impres- 
sion in headphone listening, the properties (G d , G x ) of 
the filtering means 1,2,3,4 and the properties (ITD) of 
the delay adding means 5,6 need to be chosen properly. 
According to the invention, this selection is based on 
natural listening and behaviour of a set of simplified 
HRTFs in such situation. 

[0038] Values for G d and G x can be derived by con- 
sidering the physics of sound propagation. When an ob- 
ject, like the head of a human listener, is positioned in 
an incident sound field, like one produced by two loud- 
speakers in a natural listening situation, the sound field 
is not significantly disturbed by the object if the wave- 
length of the sound waves is long enough compared to 
the size of the object. Given the size of a human head, 
this means that gains G d and G x can be taken to be con- 
stant as a function of frequency, and further substantially 
equal to each other at frequencies lower than approxi- 
mately 1 kHz. At higher frequencies, where the wave- 
lengths of the sound waves become short compared to 
the size of the object, a pressure build-up takes place 
on the side of the object which is towards the source of 
the sound waves, and there will be pressure attenuation 
taking place on the far side of the object. The latter effect 
can be referred as shadowing. If the object has relatively 
simple shape so that it does not significantly focus the 
sound field, and furthermore, if it is substantially rigid, a 
pressure doubling will take place on the near side of the 
object at high frequencies, and no sound waves will 
reach the shadowed zone on the far side of the object. 
[0039] On the basis of the facts mentioned above and 
according to the invention, G d and G x can be thus given 
a value equal to one at frequencies below acertain lower 
frequency limit denoted f !ow , and G d can be given a sub- 
stantially constant value significantly greater than one, 
and G x can be given a substantially constant value sig- 
nificantly less than one at frequencies above a certain 



higher frequency limit f high . 

[0040] In an advantageous embodiment of the inven- 
tion G d and G x are set equal to one at frequencies below 
f, ow , and G d is set to 2 and G x is set to zero at frequencies 
5 higher than f hjgh The aforementioned behaviour of the 
gains G d and G x as a function of frequency is schemat- 
ically illustrated in Fig. 3 in graphs inside the blocks cor- 
responding to the filtering means 1,2 and 3,4. Thus, if 
neither G x or G d varies too rapidly in the transition band 
10 between f )ow and f high , the total gain of the sum signal 
L d + L x , and similarly the total gain of the sum signal R d 
+ R x is always very close to 2. In this case one can en- 
sure that the network BSWN does not affect the total 
gain, i.e. amplify the signals, by scaling the direct L d ,R d 
15 and cross-talk L X ,R X paths each by a factor of 0.5 prior 
filtering. This can be accomplished by scaling the sig- 
nals using scaling means 9,10,11,12. To clarify the 
aforementioned effect, we can observe the behaviour of 
a signal, which is connected to input L in . At low frequen- 
ce cies below f )ow , said signal passes both filtering means 
1 (G d = 1 ) and 2 (G x = 1 ) and due to the aforementioned 
scaling by 0.5, the sum of the outputs of the filtering 
means 1 and 2 has not been amplified with respect to 
the original input signal L in . At higher frequencies, the 
25 signal passes only filtering means 1 (G d = 2), and again 
due to the scaling by 0.5, the sum of the outputs of the 
filtering means 1 and 2 has not been amplified with re- 
spect to the original inputsignal L in . Consequently, when 
a pure sine wave signal is used as input L in , at low fre- 
30 quencies below f !ow it is split equally between outputs 
L out and R out , and the sum of the amplitudes of the out- 
puts L out and R out equals to the amplitude of the input 
L in . At higher frequencies above f high , the signal passes 
only through the left channel direct path L d and the am- 
35 plitude of the output L out equals the amplitude of the 
original input L in . The above described scaling affects 
the right channel of the network BSWN in a similar man- 
ner, and it is the reason why the stereo widening network 
BSWN according to the invention is referred to as a bal- 
40 anced network. In yet other words, the sum of the mag- 
nitude responses of the corresponding ipsilateral and 
contralateral HRTFs remain constant as a function of 
frequency and no net amplification of the signals takes 
place. 

45 [0041 ] The values of frequency limits f jow and f high for 
filtering in filtering means 1,2,3,4 are not very critical. 
Suitable value for f !ow can be, for example, 1 kHz, and 
for f high 2 kHz. Other values close to these aforemen- 
tioned values can also be used, f !ow , however, being al- 
so ways somewhat smaller than f high , and the transition fre- 
quency band between the said frequency limits should 
not also be made too wide. 

[0042] In an advantageous embodiment of the inven- 
tion, the low-pass characteristics of second filtering 
55 means 2 (L x ) and fourth filtering means 4 (R x ) are made 
more dramatic than the corresponding effect that it em- 
ulates in the real natural listening situation, i.e. in the 
frequency range above f !ow the corresponding gain G x 
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is forced to zero. This prevents unwanted comb-filtering 
of the monophonic component, i.e. the component 
which is common to both L in and R jn , at higher frequen- 
cies, which is important so that colouring of the repro- 
duced sound can be avoided in high-quality, high-fidelity 
recordings. Comb filtering of the monophonic compo- 
nent at low frequencies can be dealt with separately if 
desired, for example by applying decorrelation, or by ap- 
plying a method whose purpose essentially is to equal- 
ize the monophonic part of the output, either through ad- 
dition or convolution. 

[0043] Strictly speaking, the interaural time difference 
ITD between the direct path and cross-talk path is also 
frequency dependent, but it can be assumed to be con- 
stant in order to simplify the implementation of the meth- 
od. For sound sources directly in front of the listener the 
value of ITD is zero, and the highest value encountered 
when listening to real sound sources is around 0.7 ms, 
corresponding to the situation where the sound source 
is directly to the side of the listener. The value of ITD 
thus affects the amount of widening perceived by the 
listener. For a desired widening effect the interaural time 
difference ITD can be selected to have a suitable value 
larger than zero but less than 1 ms. A value of 0.8 ms, 
for example, is good for a very high degree of stereo 
widening, but if ITD is selected to be > 1 ms, the result 
becomes very unnatural and therefore uncomfortable to 
listen. The embodiments of the invention are however 
not limited only to such cases where ITD is given a non- 
frequency dependent constant value. It is also possible 
to use, for example, an allpass filter to vary the value of 
ITD as a function of frequency. 

[0044] Fig. 4a shows a block diagram of a simple dig- 
ital filter structure 41, which can be used to efficiently 
and advantageously implement the balanced stereo 
widening network BSWN in practice. The filter structure 
41 takes advantage of the known fact that the output of 
a digital linear phase low-pass filter 42 can be modified 
so that the result corresponds to the output of another 
linear phase digital filter that also passes low frequen- 
cies straight through, i.e. with gain equal to one, but 
which said another filter has a different magnitude re- 
sponse at higher frequencies. Thus, a magnitude re- 
sponse of the type shown in Fig. 4b can be realised from 
the output of a digital linear phase low-pass filter 42 with 
little additional processing. The additional processing 
requires the use of a separate digital delay line 43, 
whose length Ip in samples corresponds to the group 
delay of the low-pass filter 42. The input digital signal 
stream S in is directed similarly and simultaneously to the 
inputs of the delay line 43 and the low-pass filter 42. The 
output of the delay line 43 is multiplied using multiplica- 
tion means 44 by G, which value of G is the desired high- 
frequency magnitude response of the filter structure 41 . 
The output of the low-pass filter 42 is multiplied by mul- 
tiplication means 45 by 1 -G. The outputs of the two par- 
allel branches formed by the low-pass filter 42 connect- 
ed with multiplication means 45, and the delay line 43 



connected with multiplication means 42, are added to- 
gether using adding means 46. In practice, the group 
delay of the linear phase low-pass filter 42 is in the order 
of 0.3 ms, which corresponds to 1 3 samples at 44.1 kHz 

5 sampling frequency. 

[0045] Fig. 5 shows schematically how the digital filter 
structure 41 shown in Fig. 4a can be used to achieve 
computational saving by directing the left channel digital 
signal stream L in simultaneously and in parallel into a 

10 single digital linear phase low-pass filter 52 and into a 
digital delay line 53. In this way it is possible to imple- 
ment the two filters, one for the direct path (first filtering 
means 1 in Fig. 3) and another for the cross-talk path 
(second filtering means 2 in Fig. 3) so that in addition to 

15 the aforementioned digital low-pass filter 52 and digital 
delay line 53, only the use of multiplication means 
54,55,56,57 and adding means 58,59 is required. Thus, 
Fig. 5 shows the signal processing elements that emu- 
late a virtual loudspeaker L to the left of the listener and 

20 is responsible for the generation of signal paths L d and 
L x . Fig. 5 corresponds substantially to the upper half of 
the balanced stereo widening network BSWN shown in 
Fig. 3. It is obvious for anyone skilled in the art that the 
signal processing elements required to emulate the vir- 

25 tual loudspeaker R to the right of the listener can be im- 
plemented in a corresponding manner. 
[0046] Fig. 6 shows a block diagram of the balanced 
stereo widening network BSWN, which is implemented 
by using the digital filter structure 41 described above 

30 in Figs 4a and 5, and further corresponds to the specific 
case when G d is given a value of 2 and G x is given a 
value of zero. In addition, gains G d (means 54), 1-G d 
(means 55), G x (means 56), 1-G X (means 57) shown in 
Fig. 5 for the left channel have each been in Fig. 6 scaled 

35 for both the left and right channel by a factor of 0.5 to 
balance the overall levels of output signals L out ,R out 
compared to the levels of the original input signals L in , 
R in . This causes in this specific case, and in an advan- 
tageous embodiment of the invention, the reduction of 

40 the stereo balanced widening network BSWN into the 
simple structure shown in Fig. 6, in which structure the 
four filtering means 1,2,3,4 can, in practice, be imple- 
mented by using only two convolutions. Said convolu- 
tions take place in the linear low-pass filters 65 and 66, 

45 respectively. The reduced network structure shown in 
Fig. 6 is very robust numerically, and thus it is very suit- 
able for implementation in fixed point arithmetic. 
[0047] The balanced stereo widening network BSWN 
according to the invention can be used as a stand-alone 

50 signal processing method, but in practice it is likely that 
it will be used together with some kind of pre- and/or 
post-processing. Fig. 7 illustrates schematically the use 
of some possible pre- and post-processing methods, 
which said methods are well known in the art as such, 

55 but which could be used together with the balanced ster- 
eo widening network BSWN in order to further improve 
the quality of the listening experience. 
[0048] Fig. 7 illustrates the use of decorrelation for 
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signal pre-processing before the signals enter into the 
balanced stereo widening network BSWN. Decolla- 
tion of the source signals L s and R s guarantees that the 
signals L in and R in , which are the input to the balanced 
stereo widening network BSWN always differ to some 
degree even if the L s and R s signals from a digital source 
are identical. The effect of decorrelation is that the 
sound component which is common to both left and right 
channels, i.e. monophonic, is not heard as localized in 
a single point, but rather it is spread out slightly so that 
it is perceived as having a finite size in the sound scen- 
ery. This prevents the sound scenery or stage from be- 
coming too "crowded" near the centre. In addition, the 
decorrelation effectively reduces the attenuation of the 
monophonic component in the transition band between 
f| OW and f high caused by the interference between the di- 
rect path and cross-talk path. Decorrelation can be im- 
plemented using two complementary comb-filters as in- 
dicated in Fig. 7. Comb-filters with a common delay of 
the order 15 ms are suitable for this purpose. The values 
of the coefficients b 0 and b N can be set to, for example, 
1 .0 and 0.4, respectively. The different sign on b N in the 
two channels (in Fig. 7 +b N in the left channel and -b N 
in the right channel) ensures that the sum of the magni- 
tudes of the two transfer functions remains constant ir- 
respective of the frequency. Consequently, the comb 
decorrelation is balanced in a way similar to the bal- 
anced stereo widening network BSWN. 
[0049] Fig. 7 further illustrates schematically the use 
of equalization, for example low-frequency boost, in or- 
der to compensate for the non-ideal frequency response 
of the headphones. Preferably, equalization that is used 
to restore the spectral frequency balance of the record- 
ing in playback using headphones, is implemented by 
post-processing so that it does not affect the excellent 
dynamic properties of the balanced stereo widening net- 
work BSWN. 

[0050] It is obvious for a person skilled in the art that 
the present invention is not restricted solely to the em- 
bodiments presented above, but it can be freely modi- 
fied within the scope of the appended claims. 
[0051 ] It is possible to implement the method accord- 
ing to the invention also by using analog electronics, but 
it is obvious for anyone skilled in the art that the pre- 
ferred embodiments are based on digital signal process- 
ing techniques. The digital signal processing structures 
of the balanced stereo widening network BSWN, for ex- 
ample the linear phase low-pass filtering in the cross- 
talk path, can also be realized in many other ways. Dif- 
ferent techniques for this are well documented in litera- 
ture. 

[0052] The method according to the invention is in- 
tended for converting audio material having signals in 
the general two-channel stereo format for headphone 
listening. This includes all audio material, for example 
speech, music or effect sounds, which are recorded 
and/or mixed and/or otherwise processed to create two 
separate audio channels, which said channels can also 



further contain monophonic components, or which 
channels may have been created from a monophonic 
single channel source for example, by decorrelation 
methods and/or by adding reverberation. This also al- 

5 lows the use of the method according to the invention 
for improving the spatial impression in listening different 
types of monophonic audio material. 
[0053] The media providing the stereo signals for 
processing can include, for example, CompactDisc™, 

10 MiniDisc™, MP3 or any other digital media including 
public TV, radio or other broadcasting, computers and 
also telecommunication devices, such as multimedia 
phones. Stereo signals may also be provided as analog 
signals, which, prior to the processing in a digital BSWN 

15 network, are first AD-converted. 

[0054] The signal processing device according to the 
invention can be incorporated into different types of port- 
able devices, such as portable players or communica- 
tion devices, but also into non-portable devices, such as 

20 home stereo systems or PC-computers. 



Claims 

25 1 . A method for converting two-channel stereo format 
left (L) and right (R) channel input signals (L in ,R in ) 
into left and right channel output signals (L out ,R out ), 
in which method 

30 — left direct path (L d ) and left cross-talk path (L x ) 
signals are formed from the left input signal 
(L in ), and correspondingly 

— right direct path (R d ) and right cross-talk path 
(R x ) signals are formed from the right input sig- 

35 nal (R in ), and 

— the left output signal (L out ) is formed by com- 
bining said left direct-path (L d ) and said right 
cross-talk path (R x ) signals, and correspond- 
ingly, 

40 — the right output signal (R out ) is formed by com- 
bining said right direct-path (R d ) and said left 
cross-talk path (L x ) signals, 

which said left and right channel output signals 
45 (L out , R out ) thereby become suitable for headphone 
listening, characterized in that 

— the direct path signals (L d ,R d ) each are formed 
using filtering (1,3) associated with first fre- 

50 quency dependent gain (G d ), 

— the cross-talk path signals (L X ,R X ) each are 
formed using filtering (2,4) associated with sec- 
ond frequency dependentgain (G x ) and by add- 
ing interaural time difference (ITD) (5,6), 

55 — said first and second frequency dependent 
gains (G d , G x ) are given a common substantial- 
ly constant reference value below a first fre- 
quency limit (f bw ), 
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— said first frequency dependent gain (G d ) is giv- 
en a substantially constant value significantly 
greater than said reference value, and said sec- 
ond frequency dependent gain (G x ) is given a 
substantially constant value significantly less 5 
than said reference value above a second fre- 
quency limit (f high ), where 

— said second frequency limit (f high ) is greater 
than said first frequency limit (f| OW ), and 

— said interaural time difference (ITD) is given a 10 
frequency independent constant value or alter- 
natively a frequency dependent value. 

2. The method according to claim 1 , characterized in 
that 15 

— said first and second frequency dependent 
gains (G d , G x ) are given both a value of one 
below said first frequency limit (f !ow ), and 

— said first frequency dependent gain (G d ) is giv- 20 
en a value of 2, and said second frequency de- 
pendent gain (G x ) is given a value of zero above 
said second frequency limit (f high ). 

3. The method according to claims 1 or 2, character- 25 
ized in that said direct path signals (L d , R d ) both are 
scaled by a first scaling factor (S d ) and said cross- 
talk path signals (L x , R x ) both are scaled by a second 
scaling factor (S x ) in order to make the sum ampli- 
tude of the output signals (L out , R out ) to substantially 30 
match the sum amplitude of the input signals (L in , 

Rin)- 

4. The method according to claims 2 and 3, charac- 
terized in that the said first and second scaling fac- 35 
tors (S x ,S d ) both are given a value of 0.5. 

5. The method according to any of the foregoing 
claims 1 to 4, characterized in that said first fre- 
quency limit (f !ow ) is given a value around 1 kHz and 40 
said second frequency limit (f high ) is given a value 
around 2 kHz. 



— second filtering means (2) associated with sec- 
ond frequency dependent gain (G x ) in serial 
with first delay adding means (5) associated 
with interaural time difference (ITD) to form left 
cross-talk path signal (L x ) from said left input 
signal (L in ), 

— third filtering means (3) associated with first fre- 
quency dependent gain (G d ) to form right direct 
path signal (R d ) from said right input signal 

(Rin), 

— fourth filtering means (4) associated with sec- 
ond frequency dependent gain (G x ) in serial 
with second delay adding means (6) associated 
with interaural time difference (ITD) to form 
right cross-talk path signal (R x ) from said right 
input signal (R in ), 

— first combining means (7) to form the left output 
signal (L out ) by combining said left direct-path 
(L d ) and said right cross-talk path (R x ) signals, 
and correspondingly, 

— second combining means (8) to form the right 
output signal (R out ) by combining said right di- 
rect-path (R d ) and said left cross-talk path (L x ) 
signals, and 

— said first and second frequency dependent 
gains (G d ,G x ) having a common constant ref- 
erence value below a first frequency limit (f| OW ), 

— said first frequency dependent gain (G d ) having 
a substantially constant value significantly 
g reater than said reference value, and said sec- 
ond frequency dependent gain (G x ) having a 
substantially constant value significantly less 
than said reference value above a second fre- 
quency limit (f h j gh ), where 

— said second frequency limit (f hign ) is greater 
than said first frequency limit (f !ow ), and 

— said interaural time difference (ITD) is having a 
frequency independent constant value or alter- 
natively a frequency dependent value. 

The signal processing device (BSWN) according to 
claim 7, characterized in that 



6. The method according to any of the foregoing 
claims 1 to 5, characterized in that the interaural 45 
time difference (ITD) is given value/values below 1 

ms. 

7. A signal processing device (BSWN) for converting 
two-channel stereo format left (L) and right (R) 50 
channel input signals (L in ,R in ) into left and right 
channel output signals (L out ,R out ) suitable for head- 
phone listening, characterized in that the signal 
processing device (BSWN) comprises at least 

55 

— first filtering means (1 ) associated with first fre- 
quency dependent gain (G d ) to form left direct 
path signal (L d ) from said left input signal (L in ), 



— said first and second frequency dependent 
gains (G d , G x ) have a value of one below said 
first frequency limit (f !ow ), and 

— said first frequency dependent gain (G d ) has a 
value of 2, and said second frequency depend- 
ent gain (G x ) has a value of zero above said 
second frequency limit (f high ). 

The signal processing device (BSWN) according to 
claims 7 or 8, characterized in that the direct paths 
(L d ,R d ) each comprise first scaling means (9,11) as- 
sociated with a first scaling factor (S d ) and the 
cross-talk paths (L X ,R X ) each comprise second 
scaling means (10,12) associated with a second 
scaling factor (S x ) in order to scale each path to 
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make the sum amplitude of the output signals (L out , 
R out ) to substantially match the sum amplitude of 
the input signals (L in ,R jn ). 

10. The signal processing device (BSWN) according to 5 
claims 8 and 9, characterized in that said first and 
second scaling factors (S d ,S x ) both have a value of 
0.5. 

11. The signal processing device (BSWN) according to 10 
any of the foregoing claims 7 to 1 0, characterized 

in that said first frequency limit (f !ow ) has a value 
around 1 kHz and said second frequency limit (f high ) 
has a value around 2 kHz. 

15 

12. The signal processing device (BSWN) according to 
any of the foregoing claims 7 to 11, characterized 
in that the interaural time difference (ITD) has val- 
ue/values below 1 ms. 

20 

13. The signal processing device (BSWN) according to 
any of the foregoing claims 7 to 1 2, characterized 
in that the signal processing device (BSWN) is a 
digital signal processor and/or digital signal 
processing network. 25 

14. The signal processing device (BSWN) according to 
claim 1 3, characterized in that the first (1 ) and sec- 
ond (2) filtering means, and correspondingly the 
third (3) and fourth (4) filtering means are formed 30 
using a specific digital filter structure (41 ), in which 
filter structure the output of a linear phase low-pass 
filter (42;52) is combined with the output of a parallel 
digital delay line (43;53) having delay equal to the 
group delay of said low-pass filter (42;53). 35 

15. The signal processing device (BSWN) according to 
claim 1 4, characterized in that the first (1 ), second 
(2), third (3) and fourth (4) filtering means are im- 
plemented using reduced network structure (Fig. 6) 40 
based on performing two convolutions. 

16. The signal processing device (BSWN) according to 
any of the foregoing claims 1 3 to 1 5, characterized 

in that the input signals (L jn ,R jn ) are preprocessed 45 
using a method that performs decorrelation. 
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